The Nobel Prize is perhaps one of the most widely recognized and known scientific awards. Besides, the honour and prestige, as well as notable prize money that comes with winning Nobel Prize, winners also receive a gold medal showing Alfred Nobel (1833 - 1896) who established the prize. So, each Nobel Prize consists of a gold medal, a diploma bearing a citation, and a sum of money, the amount of which depends on the income of the Nobel Foundation.
The Nobel Prizes are five separate prizes that, according to Alfred Nobel's will of 1895, are awarded to "those who, during the preceding year, have conferred the greatest benefit to mankind". Alfred Nobel was a Swedish chemist, engineer, and industrialist most famously known for the invention of dynamite. So every the Nobel Prize is given to scientists and scholars in the categories chemistry, literature, physics, physiology or medicine, economics, and peace. The awarding of Nobel Prize's dates back to 1901.
A Nobel laureate is a recipient of the Nobel Prize. The award is given annually for outstanding achievement in the fields of physics, chemistry, medicine or physiology, literature, and economics, and for the promotion of peace. It is widely considered one of the most prestigious awards in these fields. At the beginning of October, the Nobel Committee chooses the Nobel Peace Prize laureates through a majority vote. The decision is final and without appeal. The names of the Nobel Peace Prize laureates are then announced. December – Nobel Prize laureates receive their prize.
Here we shall load the libraries used in this analysis. We soon after read in our data set, which was taken from the Nobel Foundation, which has made a dataset available of all prize winners from the start of the prize, in 1901, to 2016.
# Loading in required libraries
import pandas as pd # for used for data cleaning and analysis
import seaborn as sns # for making statistical graphics
import numpy as np # for working with ndarray
import matplotlib.pyplot as plt # for plotting
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
# Reading in the Nobel Prize data
nobel = pd.read_csv('datasets/nobel.csv')
Here we shall look at the structure of the data and see if we need to change the variable types or need to create/remove additional variables.
# Structure of Data Set
nobel.info()
We have 18 variables in this data set, from year, category, prize, motivation, sex, and others. We shall look at the distributions of these variables individually and in pairs, for any relationships that could exist between variables.
Note the following description of the some of the variables:
We note that some of the variables are not stored as correct data types. For example, we shall change the death_date and birth_date variables to date variables.
# convert columns birth_date and death_date to data
nobel[["birth_date"]] = nobel[["birth_date"]].apply(pd.to_datetime)
nobel[["death_date"]] = nobel[["death_date"]].apply(pd.to_datetime)
# Taking a look at data set
nobel.head(n=6)
# Summary of Numeric Variables
nobel.describe().transpose()
Our data set has data from 1901 (starting year for prize) all the way to 2016, indicated by the range from describe() function. As highlighted earlier laurete_id is simply a unique identifier for the winners and holds no inherent analysis value.
nobel.category.value_counts().plot(kind = 'bar', figsize = (15,5))
plt.xlabel("Category of Prize", labelpad=14)
plt.ylabel("Count", labelpad=14)
plt.title("Count of Prizes by Category", y=1.02);
Most prizes are awarded to individuals in fields of medicine and then closely behind, Physics. Economics recieved the lowest number of prizes of the award's existence, however, it is worth noting that the economics Nobel Prize was only introduced in 1969. The specific counts are given below per category.
# Category Counts
counts_category = nobel.category.value_counts()
counts_category
Let's look at who the awards were given to - individuals or organisations?
# Laureate Counts
nobel['laureate_type'].value_counts()
We have that majority of the awards are given to individuals. Now lets look at which were the top 10 cities were Laureates appear to be from.
# Top 10 Cities where Laureates are from
nobel['birth_city'].value_counts().head(10)
We also want to know which are the top 10 organisations where laureates are from.
# Top 10 Organisations where Laureates are from
nobel['organization_name'].value_counts().head(10)
And finally, it would be nice to know which countries the laeureates are from and what the top 10 birth countries are.
# Top 10 Country where Laureates are from
nobel['organization_country'].value_counts().head(5)
# Display the number of prizes won by the top 10 nationalities.
nobel['birth_country'].value_counts().head(10)
Lets try to visually summarise these findings in one visualisation. See below.
plt.rcParams["figure.figsize"] = (20,20)
# number of prizes won by the top 10 nationalities.
plt.subplot(4, 2, 1)
nobel['birth_country'].value_counts().head(10).plot(kind = 'bar')
plt.xlabel("Birth Country", labelpad=14)
plt.ylabel("Count", labelpad=14)
plt.title("Count of Prizes by Birth Country of Laureates", y=1.02)
# Top 10 Country where Laureates are from
plt.subplot(4, 2, 2)
nobel['organization_country'].value_counts().head(10).plot(kind = 'bar')
plt.xlabel("Country", labelpad=14)
plt.ylabel("Count", labelpad=14)
plt.title("Count of Prizes by Country", y=1.02)
# Top 10 Organisations where Laureates are from
plt.subplot(4, 2, 3)
nobel['organization_name'].value_counts().head(10).plot(kind = 'bar')
plt.xlabel("Organisation Name", labelpad=14)
plt.ylabel("Count", labelpad=14)
plt.title("Count of Prizes by Organisation", y=1.02)
# Top 10 Cities where Laureates are from
plt.subplot(4, 2, 4)
nobel['birth_city'].value_counts().head(10).plot(kind = 'bar')
plt.xlabel("Birth City", labelpad=14)
plt.ylabel("Count", labelpad=14)
plt.title("Count of Prizes by Birth City of Laureates", y=1.02)
# Show Plots
plt.tight_layout()
plt.show();
We see a clear USA dominance. This is maybe not so surprising perhaps, as it is widely known that a lot of famous researchers and scientists that are somewhat "mainstream" were from America. So, given that the most common Nobel laureate between 1901 and 2016 was an individual born in the United States of America, was this always the case?
first_half = nobel[["year", "category", "birth_country", "sex", "full_name"]][nobel.year <=1959]
last_half = nobel[["year", "category", "birth_country", "sex", "full_name"]][nobel.year > 1959]
print("\n \n First Half Century Birth Country Counts: \n \n",first_half.birth_country.value_counts().head(10))
print("\n \n last half century birth country counts: \n \n",last_half.birth_country.value_counts().head(10))
#Pre 1959 Country Winners
first_half.birth_country.value_counts().head(10).plot(kind = 'bar', figsize = (15,5))
plt.xlabel("Country", labelpad=14)
plt.ylabel("Count", labelpad=14)
plt.title("Count of Prizes by Country Before 1959", y=1.02)
plt.show()
#Post 1959 Country Winners
last_half.birth_country.value_counts().head(10).plot(kind = 'bar', figsize = (15,5))
plt.xlabel("Country", labelpad=14)
plt.ylabel("Count", labelpad=14)
plt.title("Count of Prizes by Country After 1959", y=1.02)
plt.show();
From splitting the years in half and looking at the first 58 years of the Nobel Prize's (from 1901 to 1959) and comparing the results of the countries to that of the latter half of laureates, we see that European countries were fare superior, in terms of the number of laureates, compared to more recent time (between 1959 and 2016. So when did USA actually start to dominate the Nobel Prize charts?
# Calculating the proportion of the USA born winners per decade
nobel['usa_born_winner'] = nobel['birth_country']=="United States of America" # add column (boolean) for USA BORN WINNER
nobel['decade'] = (np.floor(nobel['year'] / 10) * 10).astype(int) # add column for decade which award was given
prop_usa_winners = nobel.groupby('decade', as_index=False)['usa_born_winner'].mean()
prop_usa_winners_years = nobel.groupby('year', as_index=False)['usa_born_winner'].mean()
# Setting the plotting theme
sns.set()
# setting the size of all plots.
plt.rcParams['figure.figsize'] = [11, 7]
# Plotting USA born winners
ax = sns.lineplot(x=prop_usa_winners['decade'], y=prop_usa_winners['usa_born_winner'])
ax.set_title("Proportion of USA Born Winners by Decade")
ax.set_xlabel("Decade")
ax.set_ylabel("Proportion of USA Born Winner")
# Adding %-formatting to the y-axis
from matplotlib.ticker import PercentFormatter
ax.yaxis.set_major_formatter(PercentFormatter(1.0));
# Plotting USA born winners yearly
ax = sns.lineplot(x=prop_usa_winners_years['year'], y=prop_usa_winners_years['usa_born_winner'])
ax.set_title("Proportion of USA Born Winners by Year")
ax.set_xlabel("Year")
ax.set_ylabel("Proportion of USA Born Winner");
We clearly identify an increasing trend here. USA born winners appears to grow steadily in the beginning of the century (20th century) between 1900 and 1920. It then rapidly increases till about 1940. From here we see steady growth in number of USA born winners. In the year 2000 it peaked at over 40% of winners were born in USA.
So as we have established in the previous section, that the USA became the dominating winner of the Nobel Prize first in the 1930s and had kept the leading position ever since. An interesting path to follow int the analysis would be to explore how the distribution of sex amongst the laureates from 1901 to 2016.
# Display the number of prizes won by male and female recipients
nobel['sex'].value_counts()
We see that the vast majority (0ver 90%) of the winners are male. A mere 5.86% (around 49 individuals out of 885) are female. We shall continue to explore this imbalance, and investigate if it is better or worse within specific prize categories like physics, medicine, literature, etc.
# Calculating the proportion of female laureates per decade
nobel['female_winner'] = nobel['sex']=='Female' # add new column for if winner was female
prop_female_winners = nobel.groupby(['decade','category'], as_index=False)['female_winner'].mean() #new variable for proportion of female winner per decade
# Plotting USA born winners with % winners on the y-axis
ax = sns.lineplot(x='decade', y='female_winner', hue='category', data=prop_female_winners)
ax.yaxis.set_major_formatter(PercentFormatter(1.0));
This line plot shows some interesting trends and patterns. It appears most female awards, particularly in the earlier years were mainly for Literature - this is particularly prevalent between 1920's and 1940's. We all see literature pick up again for females from 1980's. Overall the imbalance is pretty large with physics, economics, and chemistry having the largest imbalance. The imbalance appears to quite great in the most recent years, were there is large disparity between proportion of awards for females between the categorises. Medicine has a somewhat positive trend, and since the 1990s the literature prize is also now more balanced. The big outlier is the peace prize during the 2010s, but keep in mind that this just covers the years 2010 to 2016.
The exact count of female winners in each category is given below.
# Number of Female Winners in each Category
female_winners = nobel[["year", "category","full_name"]][nobel.female_winner == True]
female_winners.category.value_counts()
For most scientists/writers/activists a Nobel Prize would be the crowning achievement of a long career. But for some people, one is just not enough, and few have gotten it more than once. Who are these lucky few?
# Selecting the laureates that have received 2 or more prizes.
nobel.groupby('full_name').filter(lambda group: len(group) >= 2)
# People who received more than one Nobel
repeated_awards = nobel['full_name'].value_counts()
repeated_awards[repeated_awards>=2]
The list of repeat winners contains some illustrious names. We see that 4 distinct individuals have won the award twice. Most notably, Marie Curie, a female, achieved this unprecendented feat in the category of physics in 1903 for discovering radiation and chemistry in 1911 for isolating radium and polonium. John Bardeen got it twice in physics for transistors and superconductivity, Frederick Sanger got it twice in chemistry, and Linus Carl Pauling got it first in chemistry and later in peace for his work in promoting nuclear disarmament.
We also learn that organizations also get the prize as both the Red Cross and the UNHCR have gotten it twice.
Now we turn our attention to the aspect of age, to investigate as to how old are you generally when you get the prize.
# Converting birth_date from String to datetime
nobel['birth_date'] = pd.to_datetime(nobel['birth_date'])
# Calculating the age of Nobel Prize winners
nobel['age'] = nobel['year'] - nobel['birth_date'].dt.year
# Plotting the age of Nobel Prize winners
ax = sns.lmplot(x='year', y='age', data=nobel, lowess=True, aspect=2, line_kws={'color' : 'black'});
From the plot, we see that people use to be around 55 when they received the price, but nowadays the average is closer to 65. These values are indicated by the white line running through the scattering of points (regression). But there is a large spread in the laureates' ages, and while most are 50+, some are very young.
We also see that the density of points is much high nowadays than in the early 1900s -- nowadays many more of the prizes are shared, and so there are many more winners.
We also see that there was a disruption in awarded prizes around the Second World War (1939 - 1945).
Next we shall explore the age trends within different prize categories.
# Same plot as above, but separate plots for each type of Nobel Prize
sns.lmplot(x='year', y='age', data=nobel, row='category' ,lowess=True, aspect=2, line_kws={'color' : 'black'});
For the rows of plots, we see that both winners of the chemistry, medicine, and physics prize have gotten older over time. The trend is strongest for physics: the average age used to be below 50, and now it's almost 70. Literature and economics are more stable. We also see that economics is a newer category. But peace shows an opposite trend where winners are getting younger! In the peace category we also a winner around 2010 that seems exceptionally young.
The first prize in economic sciences was awarded to Ragnar Frisch and Jan Tinbergen in 1969, hence why the data only starts from that year.
# The oldest winner of a Nobel Prize as of 2016
display(nobel.nlargest(1, 'age'))
# The youngest winner of a Nobel Prize as of 2016
display(nobel.nsmallest(1, 'age'))